Data Summaries and Functions




Data Summaries

We've looked at a few graphical techniques for exploring data, and now we're going to turn to a numerical one. Consider the question "Which day of the week has the highest average box office for hit movies released on that day?". As a first step in answering that question, it would be helpful to look at the mean box office receipts for each of the days. If you look for a function to do that specific task, you probably wouldn't find one, because R takes the more general approach of providing a function that will allow you to calculate anything you want from vectors of values broken down by groups. In fact, there are a variety of ways to do this. The one we're going to look at is called aggregate. You pass aggregate a vector or data frame containing the variables you want to summarize, a list of the groups to summarize by, and the function you'd like to use for your summaries. That way, a single function can perform many tasks, and, as we'll see when we learn to write functions, it even allows R to do things that the developers of R never imagined. For now, we'll stick to some built in functions, like mean. To find the means for the box office receipts for each day of the week, we could use a call to aggregate like this:
> aggregate(movies$box,movies['weekday'],mean)
    weekday        x
1    Monday 126.7766
2   Tuesday 104.8419
3 Wednesday 127.1272
4  Thursday 104.0686
5    Friday 102.6522
6  Saturday  82.2441
7    Sunday 103.0268

The same thing could be done to calculate other statistics, like median, min, max, or any statistic that returns a single scalar value for each group. Another nice feature of aggregate is that it if the first argument is a data frame, it will calculate the statistic for each column of the data frame. If we passed aggregate both the rank and box, we'd get two columns of summaries:
> aggregate(movies[,c('rank','box')],movies['weekday'],mean)
    weekday     rank       box
> aggregate(movies[,c('Rank','box')],movies['weekday'],mean)
    weekday     Rank      box
1    Monday 443.1538 126.7766
2   Tuesday 511.7037 104.8419
3 Wednesday 455.0116 127.1272
4  Thursday 560.5122 104.0686
5    Friday 520.0766 102.6522
6  Saturday 596.1000  82.2441
7    Sunday 497.6667 103.0268

To add a column of counts to the table, we can create a data frame from the table function, and merge it with the aggregated results:
> dat =  aggregate(movies[,c('Rank','box')],movies['weekday'],mean)
> cts = as.data.frame(table(movies$weekday))
> head(cts)
       Var1 Freq
1    Monday   13
2   Tuesday   27
3 Wednesday  172
4  Thursday   41
5    Friday  744
6  Saturday   10

To make the merge simpler, we rename the first column of cts to weekday.
> names(cts)[1] = 'weekday'
> res = merge(cts,dat)
> head(res)
   weekday Freq     Rank      box
1   Friday  744 520.0766 102.6522
2   Monday   13 443.1538 126.7766
3 Saturday   10 596.1000  82.2441
4   Sunday   12 497.6667 103.0268
5 Thursday   41 560.5122 104.0686
6  Tuesday   27 511.7037 104.8419

Finally, we can order the columns as follows:
> res[order(res$weekday),]
    weekday Freq     Rank      box
2    Monday   13 443.1538 126.7766
6   Tuesday   27 511.7037 104.8419
7 Wednesday  172 455.0116 127.1272
5  Thursday   41 560.5122 104.0686
1    Friday  744 520.0766 102.6522
3  Saturday   10 596.1000  82.2441
4    Sunday   12 497.6667 103.0268

Functions

As you've already noticed, functions play an important role in R. A very attractive feature of R is that you can write your own functions which work exactly the same as the ones that are part of the official R release. In fact, if you create a function with the same name as one that's already part of R, it will override the built-in function, and possibly cause problems. For that reason, it's a good idea to make sure that there's not already another function with the name you want to use. If you type the name you're thinking of, and R responds with a message like "object "xyz" not found" you're probably safe.
There are several reasons why creating your own functions is a good idea.
  1. If you find yourself writing the same code over and over again as you work on different problems, you can write a function that incorporates whatever it is you're doing and call the function, instead of rewriting the code over and over.
  2. All of the functions you create are saved in your workspace along with your data. So if you put the bulk of your work into functions that you create, R will automatically save them for you (if you tell R to save your workspace when your quit.)
  3. It's very easy to write "wrappers" around existing functions to make a custom version that sets the arguments to another function to be just what you want. R provides a special mechanism to "pass along" any extra arguments the other function might need.
  4. You can pass your own functions to built-in R functions like aggregate, by, apply, sapply, lapply, mapply, sweep and other functions to efficiently and easy perform customized tasks.
Before getting down to the details of writing your own functions, it's a good idea to understand how functions in R work. Every function in R has a set of arguments that it accepts. You can see the arguments that built-in functions take in a number of ways: viewing the help page, typing the name of the function in the interpreter, or using the args function. When you call a function, you can simply pass it arguments, in which case they must line up exactly with the way the function is designed, or you can specifically pass particular arguments in whatever order you like by providing the with names using the name=value syntax. You also can combine the two, passing unnamed arguments (which have to match the function's definition exactly), followed by named arguments in whatever order you like. For example, consider the function read.table. We can view its argument list with the command:
> args(read.table)
function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
    row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA",
    colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
    fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
    comment.char = "#", allowEscapes = FALSE, flush = FALSE,
    stringsAsFactors = default.stringsAsFactors(), encoding = "unknown")
NULL

This argument list tells us that, if we pass unnamed arguments to read.table, it will interpret the first as file, the next as header, then sep, and so on. Thus if we wanted to read the file my.data, with header set to TRUE and sep set to ',', any of the following calls would be equivalent:
read.table('my.data',TRUE,',')
read.table(sep=',',TRUE,file='my.data')
read.table(file='my.data',sep=',',header=TRUE)
read.table('my.data',sep=',',header=TRUE)

Notice that all of the arguments in the argument list for read.table have values after the name of the argument, except for the file argument. This means that file is the only required argument to read.table; any of the other arguments are optional, and if we don't specify them the default values that appear in the argument list will be used. Most R functions are written so the the first few arguments will be the ones that will usually be used so that their values can be entered without providing names, with the other arguments being optional. Optional arguments can be passed to a function by position, but are much more commonly passed using the name=value syntax, as in the last example of calling read.table.
Now let's take a look at the function read.csv. You may recall that this function simply calls read.table with a set of parameters that makes sense for reading comma separated files. Here's read.csv's function definition, produced by simply typing the function's name at the R prompt:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
    fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
    dec = dec, fill = fill, comment.char = comment.char, ...)
<environment: namespace:utils>

Pay special attention to the three periods (...) in the argument list. Notice that they also appear in the call to read.table inside the function's body. The three dots mean all the arguments that were passed to the function that didn't match any of the previous arguments in the argument list. So if you pass anything other than file, header, sep, quote, dec, or fill to read.csv, it will be part of the three dots; by putting the three dots at the end of the argument list in the call to read.table, all those unmatched arguments are simply passed along to read.table. So if you make a call to read.csv like this:
read.csv(filename,stringsAsFactors=FALSE)

the stringsAsFactors=FALSE will get passed to read.table, even though it wasn't explicitly named in the argument list. Without the three dots, R will not accept any arguments that aren't explicitly named in the argument list of the function definition. If you want to intercept the extra arguments yourself, you can include the three dots at the end of the argument list when you define your function, and create a list of those arguments inside the function body by refering to list(...).
Suppose you want to create a function that will call read.csv with a filename, but which will automatically set the stringsAsFactors=FALSE parameter. For maximum flexibility, we'd want to be able to pass other arguments (like na.strings=, or quote=) to read.csv, so we'll include the three dots at the end of the argument list. We could name the function read.csv and overwrite the built-in version, but that's not a good idea, if for no other reason than the confusion it would cause if someone else tried to understand your programs! Suppose we call the function myread.csv. We could write a function definition as follows:
> myread.csv = function(file,stringsAsFactors=FALSE,...){
+    read.csv(file,stringsAsFactors=stringsAsFactors,...)
+ }

Now, we could simply use
thedata = myread.csv(filename)

to read a comma-separated file with stringsAsFactors=FALSE. You could still pass any of read.table's arguments to the function (including stringsAsFactors=TRUE if you wanted), and, if you ask R to save your workspace when you quit, the function will be available to you next time you start R in the same directory.
File translated from